extend genes in complete genomes to start/stop codons #5

ninewise · 2021-09-16T20:37:11Z

Fixes #3

ninewise · 2021-09-17T09:58:40Z

It would seem a simple T replaced by an A can have consequences.

The main difference remaining between the FGSrs and Prodigal output is now the exclusion of the reverse stop codon in the reverse strands by FGSrs, where Prodigal includes this. Which would be the expected behaviour?

pdawyndt · 2021-09-17T10:09:38Z

For me, the expected behaviour is uniform treatment of stop codons for proteins on forward and reverse strands.

jianshu93 · 2021-09-17T15:13:26Z

I agree with this. Now FGSrs is inconsistent for + and - in terms of include stop/start codon or not compare to prodigal

Jianshu

jianshu93 · 2021-09-18T13:43:59Z

a real world application using gene prediction： mapping transriptomic reads（pair ends sequencing）to each gene to see whether there are reads mapped to antisense（reverse strand of gene strand）showed that for reverse reads，the proportion of them mapped to the reverse strand of gene strand is a lot higher than forward reads for FGSrs but ver close for prodigal. So I would think now FGSrs reverse gene position is not so perfect.

Many thanks！

Jianshu

jianshu93 · 2021-09-20T02:20:44Z

NTO_MAG_1_prodigal.new.gff.zip
NTO_MAG_1_gene.gff.zip

Hello, I tried the newest fix 02cc0fe. Now start and stop codons look good. But for some genes that are not consistent, FGSrs is always short than prodigal (large start position, same stop position). Others looks good.

Jianshu

tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 1917354 383036 508737 856214 83.35 69.13 57.05 37.27 23.23

tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2601595 527580 367165 169001 83.14 93.9 41.04 68.48 42.47

tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2650006 517695 371428 126212 83.66 95.45 41.77 74.64 46.59

tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2652412 517860 371290 123779 83.67 95.54 41.76 75.0 46.78

Ratings: tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2661082 518282 368500 117477 83.7 95.77 41.55 75.83 47.14

tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2661082 518282 368500 117477 83.7 95.77 41.55 75.83 47.14

tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2504714 280484 607613 272530 89.93 90.19 68.42 69.04 58.78

tool TP FP TN FN prec sens spec NPV MCC FGS 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58 FGS+ 2661256 533556 361341 109188 83.3 96.06 40.38 76.79 46.79 prodigal 2488137 117535 745673 313996 95.49 88.79 86.38 70.37 70.36 FGSrs 2494195 231257 656879 283010 91.51 89.81 73.96 69.89 62.58

jianshu93 · 2021-09-22T16:54:39Z

Hello This time it improves by a lot. The ffn and faa output positions in the head is consistent with gff output but stop and start codon still not included in ffn and faa sequence.

Jianshu

jianshu93 · 2021-09-22T19:32:22Z

More than that. The ffn and faa sequences (head is correct) output is not consistent with the prodigal fna and faa output even though the gff output is the same with prodigal gff output (output ffn part not fixed according to new gene positions). Everything else looks good. Many thanks for all those fixes and I think this should be the last one then FGSrs should be almost consistent with prodigal. Thanks again and good luck with the manuscript! You could try bmc bioinformatics.

If you would like any testing and comparisons with prodigal/metaGeneMark,I will be very happy to test and by the way Mark Borodosky is my co-advisor here at Georgia tech. Any thing related to model improvement, he may be the one to give some advice.
Many thanks,
Jianshu.

ninewise · 2021-09-23T13:33:39Z

And that last commit should finish the PR. You might have noted the scripts in the meta/evaluation directory. These have helped me to check the quality of the predictions (given a complete genome with annotations from ENA). The rates script reports below table (true positives the number of bp in a prediction on the correct strand, false positives the number of bp in a prediction outside annotations or on the wrong strand, true negatives the number of bp outside annotations which weren't in any prediction and false negatives the number of bp inside annotations which weren't in any prediction; and prec(ision), sens(itivity), spec(ificity), Negative Predictive Value and Matthew's Correlation Coefficient in percentages).

tool	TP	FP	TN	FN	prec	sens	spec	NPV	MCC
FGS	2494195	231257	656879	283010	91.51	89.81	73.96	69.89	62.58
FGS+	2661256	533556	361341	109188	83.3	96.06	40.38	76.79	46.79
prodigal	2488137	117535	745673	313996	95.49	88.79	86.38	70.37	70.36
FGSrs	2494195	231257	656879	283010	91.51	89.81	73.96	69.89	62.58

Seems like in quality, prodigal still outperforms FragGeneScan(Rs). On the other hand, FGSrs is 37.3 times faster for the used dataset (see meta/evaluation/README.md).

We weren't really working on model improvement for now, just speed and code quality while replicating FGS behaviour, but this might be interesting in the future.

ninewise · 2021-09-23T13:44:33Z

@jianshu93 I'd be interested in a final review from you, too, but I can't seem to request it via the interface.

jianshu93 · 2021-09-25T04:18:21Z

Hello Felix,

I had a look at the evaluation and your benchmarking. I do not see anything wrong this time. My testings are nice considering output. I think the genome you are using are complete genomes right? prodigal actually use a different model for complete genomes. The reason why I use (-p meta) in prodigal is that my genome is not complete genome (there are gaps, not circular because they are assembly with short reads from metagenomes, not cultivated single genome). I think you would also want to try no meta to compare even though the complete model also works for contigs/assemblies (not circular) according to the original FGS description. Prodigal is a little bit different for complete and meta mode. See the paper here: https://academic.oup.com/bioinformatics/article/28/17/2223/246063?login=true

I think in the future FGSrs should also differentiate between complete genome and genome with gaps or assemblies.

Jianshu

jianshu93 · 2021-09-25T04:31:56Z

Hello Felix, when all those commits are merged, please also create a condo recipe for FGSrs. A lot of softwares like the Metagenomics binning software MaxBin are now using FGS, which is very slow for large metagenomes. This could be a greate improvement

Jianshu

jianshu93 · 2021-09-26T00:32:32Z

Hello Felix,

Another error I notice:

thread '' panicked at 'attempt to divide by zero', src/dna.rs:644:21
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

For some contig (less than 1000 bp, can reach 400 bp) run with compete mode there will always be this error. Any idea why?

Thanks,

Jianshu

ninewise · 2021-09-27T08:33:00Z

That last commit should fix the 'divide by zero'-panick: I was dividing by the length of the sequence without checking whether it was empty. I've also put adding FGSrs to bioconda on the planning. It probably shouldn't be hard, since I can just follow the FGS recipe.

Could you perhaps suggest a good contig/assembly for running this test on? Many aren't properly annotated, and without annotations I can't make the same evaluation. (Unless I assume prodigal is always correct, and compare FGSrs to prodigal, not to the annotations.)

ninewise · 2021-10-06T08:59:45Z

I've merged this PR since it solves some major problems users may currently be encountering. I've created issue #7 to track Bioconda and issue #8 to track the additional benchmark data.

jianshu93 · 2021-10-29T14:33:15Z

Hello Felix,

Sorry for such late response. I ran a test and this version is ok. I am wondering whether we can implement a fast AAI (average amino acid identity) module following FGSrs. which is to find best AA gene of one genome in another genome and also reverse. If the 2 are reciprocal best hit and then we keep this pair and do it for all genes in one genome and another genome and keep all reciprocal best hits and calculate average AAI based on all those genes. Seems search will take some time for each gene, diamond blastp could be better than we implement local alignment ourselves based on rust-bio.

do you have better idea if we do not rely on external database search software such as diamond to calculate AAI?

Many thanks,

Jianshu

ninewise · 2021-10-29T14:47:55Z

I'm afraid I'm currently finishing my PhD, so I don't really have time to look into tangents at the moment. This does sound like an interesting idea though.

jianshu93 · 2021-10-29T15:40:03Z

Hello Felix,

Good luck with you Ph.D! It's glad to hear that you are finishing it! I still have 3 years to go! Anyhow, I will implement it myself, possibly rely on blast+ suite to do best hit searching.

Jianshu

jianshu93 · 2021-10-31T19:27:04Z

Hello Felix,

Can you also add a support for gzipped fasta file as input? Should be very easy to implement? Or is is already supported? Think about when we have huge number of genomes downloaded from NCBI, which are all in *.fna.gz format.

Thanks,

Jianshu

ninewise marked this pull request as draft September 16, 2021 20:37

ninewise force-pushed the fix/complete-start branch 2 times, most recently from bdb2aa0 to ec569a9 Compare September 16, 2021 21:15

ninewise marked this pull request as ready for review September 17, 2021 10:01

Felix Van der Jeugt added 9 commits September 21, 2021 14:02

ninewise force-pushed the fix/complete-start branch from 02cc0fe to 22f2f20 Compare September 22, 2021 15:12

Felix Van der Jeugt added 2 commits September 22, 2021 17:55

invert s_save to avoid negative numbers

692c10c

ninewise force-pushed the fix/complete-start branch from d5c451a to 2a3ebcd Compare September 22, 2021 15:56

Felix Van der Jeugt added 3 commits September 23, 2021 00:41

add quality measuring scripts

694db51

add last nights benchmarks to the evaluation README

b5bdd51

include start and stop codons in output

7eac426

ninewise requested a review from pdawyndt September 23, 2021 13:42

Felix Van der Jeugt added 2 commits September 29, 2021 14:04

allow empty sequences

b2fa61f

run prodigal without meta option

437ef43

ninewise force-pushed the fix/complete-start branch from 87cef70 to 437ef43 Compare September 29, 2021 12:06

ninewise merged commit 2757c4e into main Oct 6, 2021

ninewise mentioned this pull request Oct 31, 2021

Add gzipped input support #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extend genes in complete genomes to start/stop codons #5

extend genes in complete genomes to start/stop codons #5

ninewise commented Sep 16, 2021

ninewise commented Sep 17, 2021

pdawyndt commented Sep 17, 2021 •

edited

Loading

jianshu93 commented Sep 17, 2021

jianshu93 commented Sep 18, 2021

jianshu93 commented Sep 20, 2021

jianshu93 commented Sep 22, 2021

jianshu93 commented Sep 22, 2021 •

edited

Loading

ninewise commented Sep 23, 2021 •

edited

Loading

ninewise commented Sep 23, 2021

jianshu93 commented Sep 25, 2021

jianshu93 commented Sep 25, 2021

jianshu93 commented Sep 26, 2021 •

edited

Loading

ninewise commented Sep 27, 2021

ninewise commented Oct 6, 2021

jianshu93 commented Oct 29, 2021

ninewise commented Oct 29, 2021

jianshu93 commented Oct 29, 2021

jianshu93 commented Oct 31, 2021

extend genes in complete genomes to start/stop codons #5

extend genes in complete genomes to start/stop codons #5

Conversation

ninewise commented Sep 16, 2021

ninewise commented Sep 17, 2021

pdawyndt commented Sep 17, 2021 • edited Loading

jianshu93 commented Sep 17, 2021

jianshu93 commented Sep 18, 2021

jianshu93 commented Sep 20, 2021

jianshu93 commented Sep 22, 2021

jianshu93 commented Sep 22, 2021 • edited Loading

ninewise commented Sep 23, 2021 • edited Loading

ninewise commented Sep 23, 2021

jianshu93 commented Sep 25, 2021

jianshu93 commented Sep 25, 2021

jianshu93 commented Sep 26, 2021 • edited Loading

ninewise commented Sep 27, 2021

ninewise commented Oct 6, 2021

jianshu93 commented Oct 29, 2021

ninewise commented Oct 29, 2021

jianshu93 commented Oct 29, 2021

jianshu93 commented Oct 31, 2021

pdawyndt commented Sep 17, 2021 •

edited

Loading

jianshu93 commented Sep 22, 2021 •

edited

Loading

ninewise commented Sep 23, 2021 •

edited

Loading

jianshu93 commented Sep 26, 2021 •

edited

Loading